Learning Information Structure in the Prague Treebank
نویسنده
چکیده
This paper investigates the automatic identification of aspects of Information Structure (IS) in texts. The experiments use the Prague Dependency Treebank which is annotated with IS following the Praguian approach of Topic Focus Articulation. We automatically detect t(opic) and f(ocus), using node attributes from the treebank as basic features and derived features inspired by the annotation guidelines. We show the performance of C4.5, Bagging, and Ripper classifiers on several classes of instances such as nouns and pronouns, only nouns, only pronouns. A baseline system assigning always f(ocus) has an F-score of 42.5%. Our best system obtains 82.04%.
منابع مشابه
(Pre-)Annotation of Topic-Focus Articulation in Prague Czech-English Dependency Treebank
The objective of the present contribution is to give a survey of the annotation of information structure in the Czech part of the Prague Czech-English Dependency Treebank. We report on this first step in the process of building a parallel annotation of information structure in this corpus, and elaborate on the automatic pre-annotation procedure for the Czech part. The results of the pre-annotat...
متن کاملInformation Structure with the Prague Arabic Dependency Treebank
The issue of information structure in language has been studied extensively both in the Prague School of Linguistics (Mathesius, 1929) and in the Functional Generative Description (FGD), one of the modern theories of representation of linguistic meaning (Sgall, 1967; Sgall et al., 1986; Hajičová and Sgall, 2003, 2004). In its entirety, FGD constitutes the framework for a family of projects in c...
متن کاملLearning to Search in Prague Dependency Treebank
We present Netgraph – an easy to use tool for searching in linguistically annotated treebanks. On several examples from the Prague Dependency Treebank we introduce the features of the searching language and show how to search for some frequent linguistic phenomena.
متن کاملLearning Verb Subcategorization from Corpora: Counting Frame Subsets
We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents...
متن کاملAutomatic Extraction of Subcategorization Frames for Czech
We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents...
متن کامل